You will use the mutate() function from the
{dplyr} package to create new variables or modify inplace
variables.
You will create new numeric, character, factor, boolean, and date variables
You will use the case_when() function from the
{dplyr} package to create new variables based on conditions.
You will know an additional handy argument of the
mutate() function: .keep.
You will mutate across multiple columns using the
across() function from the {dplyr} package.
In this lesson, we will again use the data from the COVID-19 serological survey conducted in Yaounde, Cameroon.
yaounde <- read_csv(here::here('ch04_data_wrangling/data/yaounde_data.csv'))
## a smaller subset of variables
yao <- yaounde %>% select(date_surveyed,
age,
weight_kg, height_cm,
symptoms, is_smoker)
yaoSee the previous lesson for more information about this data.
mutate()
Fig: the mutate() function. (Drawing adapted from
Allison Horst)
We use dplyr::mutate() to create new variables or modify
existing variables. The general structure is:
df %>% mutate(new_column_name = what_it_contains)
dplyr::mutate() is THE tool for changing a variable in
place. Here we are converting our height in centimeters to a height in
meters. We choose to make a new variable to have a name that reflects
this conversion, but we could have very well done an in-place change,
such as: mutate(height_cm = height_cm/100).
yao %>%
mutate(height_m = height_cm/100)# let's save this variable for later:
yao <- yao %>%
mutate(height_m = height_cm/100)dplyr::mutate() is also the tool at your disposition to
create an entirely new variable useful for your analysis, such as an
index or a ranking. Here you can find an example of an index. The
function seq(start of index:end of index) uses the
: operator to indicate the range of the index and where the
end of the index is obtained with n(), which counts the
number of entries (i.e. rows) in the yao dataset.
yao %>%
mutate(record_number = seq(1:n()))If you would like to create a rank the useful function dplyr::min_rank()
are at your disposal. The functions dplyr::desc()
allows to set the ranking in a descending order.
yao %>%
mutate(rank = min_rank(desc(weight_kg)))Convert the weight in kilograms into grams. Call your new column
weight_g and submit your dataframe with your new
column.
You can easily create a boolean variable to categorize part of your
population. Here we create a boolean varialbe, child which
is either True if the subject is a child or
False if the subject is an adult.
yao %>%
mutate(child = age <= 18)Create a boolean variable for the symptoms variable: set a condition
based on No symptoms. You should name the column
symptoms_boolean. If someone had no symptoms, they should
be set to 0, and if they had any symptom, they should be set to 1.
A common health indicator is the body mass index (BMI) which helps in categorizing the person’s global health by reflecting whether a person is overweight/underweight or not.
\[ BMI = \frac{weight (kilograms)}{height (meters)^2} \]
yao %>%
mutate(BMI = (weight_kg / (height_m)^2))# Let's keep this variable for later!
yao <-
yao %>%
mutate(BMI = (weight_kg / (height_m)^2))You can also imagine adding columns together as so:
mutate(z = x + y). As well as doing a logarithmic transform
of a variable as so: mutate(z = log(z)). Basically, you can
get anything and everything done with dplyr::mutate()!
A handy argument for the mutate function: .keep.
.keep as its name indicates, allows you to decide to keep
or drop the variables involved in dplyr::mutate().
If you want to keep all the variables involved in mutate, you can
set the .keep argument to used (keeping all
the variables that have been used).
If you want to drop the variables used in mutate, for example the
height and weight used to calculate the BMI, you can set the
.keep argument to unused (keep all the
variables that were not used within
dplyr::mutate()).
If you want to only keep the new variable created or the variable
you changed using dplyr::mutate(), you can set the
.keep argument to none (keep only the result
of dplyr::mutate()).
Let’s try an example, setting .keep to
unused i.e. dropping the height and weight variables after
creating the BMI variable.
yao %>%
mutate(BMI = (weight_kg / (height_m)^2), .keep = "unused")dplyr::case_when()
Fig: the case_when() function. (Drawing adapted from
Allison Horst)
A healthy BMI is defined between 18,5 and 25. The person has a normal weight.
If the BMI is inferior to 18,5 the person is considered too thin.
If the BMI is between 25 and 30 then the person is considered overweight.
If the BMI is above 30 then the person is considered obese.
# Let's keep this variable for the next part!
yao<-
yao %>%
mutate(BMI_classification = case_when(BMI<18.5 ~'Too thin',
BMI>=18.5 & BMI<=25 ~ 'Normal weight',
BMI >25 & BMI <= 30 ~ 'Overweight',
BMI >30 ~ 'Obese'))
yaoCreate a variable called covid19_risk encompassing the
risk factors of smoking and age for COVID-19. Define the profiles (these
are approximates based on the overall medical consensus) as follows,
using case_when:
High risk : a smoker, aged above 70
Moderate risk: an ex-smoker, aged above 70 OR a
smoker, aged between 60 and 70
Low risk : an ex-smoker, aged between 60 and 70 OR a
smoker, aged between 50 and 60
Often in a data analysis, depending on how you read in the data, you may need to do some data processing where you redefine the type of your variable. (Quick example: you may have a number that is written as a string when you want to handle it like a double or an integer.)
The main functions to change a variable type are
as.character(), as.factor(),
as.integer()
as.factorFor easier manipulation of your new variable
BMI_classification it may be essential to transform it into
a factor. Thankfully, you can do so with as.factor().
yao %>%
mutate(BMI_classification = as.factor(BMI_classification))If you want to reorder the levels of your factor variable (for a plot
for example), you can use fct_relevel().
yao %>%
mutate(BMI_classification = fct_relevel(BMI_classification,
"Obese", "Overweight",
"Normal weight", "Too thin"))as.characterType transformations are malleable and easy. If you would like to
reconvert BMI_classification to a string (for example, for
writing text on a plot), then you can simply pass it through
as.character() this time.
yao %>%
mutate(BMI_classification = as.character(BMI_classification))as.integerThe malleability also extends to numeric types. You can easily convert an integer to a double and a double to an integer.
yao %>%
select(BMI) %>%
mutate(BMI_int = as.integer(BMI),
BMI_dbl = as.double(BMI_int))If you want rounded numbers, you can use the round()
function, like: mutate(BMI_round = round(BMI))
If you apply the as.integer() function to a factor
variable, then your factor levels will be coded in a binary manner.
as.DateA final function worth mentioning is the as.Date()
function. It allows to take a string in the format
YYYY-MM-DD (Y: year, M: month, D:day) and make it a
Date variable. There are numerous advantages to a Date
object, such as being able to compare them using all the common
operators (<, >, ==,
etc)
yao %>%
mutate(date_surveyed = as.Date(date_surveyed))Transform the type of the is_smoker variable into
factors. Keep the same column name.
dplyr::across()Imagine that you want to do some complex string operations on some of
your variables, for further reports or figures. Then maybe you would
find it useful to have all those numbers (weight, height, BMI) as
characters instead of numbers. Excluding your dates’ variable,
date_surveyed, you would apply the
as.character() transformation across all
columns not equal to the dates variable !date_surveyed.
dplyr::across()
is applied following this schema:
across(statement_defining_multiple_columns, function_to_apply_across_all_columns)
The statement defining multiple columns can be:
a list of names : c("height_m", "weight_kg", "age")
OR c(height_m, weight_kg)
a condition: !sex OR
where(is.numeric)
yao %>%
mutate(across(!date_surveyed, as.character))There are extensive predefined R functions that you can use in
dplyr::across() but once in a while, you need to write your
own function.
Imagine you want to normalize the heights and weights of the different participants to use this data for further statistical analysis.
Imagine that you want the values of the distribution X
to be in a 0-1 range: you want to make your own min-max normalization
function of each element x of the distribution
X.
\[ x_{normalized} = \frac{x - min(X)}{max(X) - min(X)} \]
The tilda ~ introduces your function. The
.x references the columns one by one across which you are
applying the function: it allows to apply the function on the variables
one by one.
yao %>%
mutate(across(c("height_m", "weight_kg"),
~ (.x - min(.x)) / (max(.x) - min(.x)) ,
.names = "normalized_{.col}"),
.keep="unused") Now let’s normalize the height (in meters), the weight and the age using a mean-standard deviation normalization.
Set the argument .keep to unused and name
the new columns using the .names argument as above
(.names = "mean_std_normalization_{.col}").
The formula consists in normalizing element x of
distribution X using the mean and standard deviation of
X as follows:
\[ x_{normalized} = \frac{x-mean(X)}{std(X)} \]
Just a heads up ! This is next level: so look into it, inspire yourself, but it’s alright if it appears too complex. Also, maybe this code can be useful for your projects, feel free to copy-paste.
You can also mix-match between your custom function and predefined
functions. Such an example would be to use
dplyr::case_when() across multiple
columns.
Imagine that you want to remove the NA from your
categorical variables (let’s use is.character()), and set
them to Unknown. For all none NA entries, you
want to keep the existing value (referenced by .x as
explained in the Key Point above).
yao %>%
mutate(
across(where(is.character),
~case_when(is.na(.x) ~"Unknown",
!is.na(.x) ~ .x) ,
.names = "unk_{.col}"),
.keep="used") %>%
count(is_smoker,unk_is_smoker)The following team members contributed to this lesson:
Some material in this lesson was adapted from the following sources:
Horst, A. (2022). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original work published 2020)
Create, modify, and delete columns — Mutate. (n.d.). Retrieved 21 February 2022, from https://dplyr.tidyverse.org/reference/mutate.html
Apply a function (or functions) across multiple columns — Across. (n.d.). Retrieved 21 February 2022, from https://dplyr.tidyverse.org/reference/across.html
Artwork was adapted from: